Good day. I’m roleplaying this exercise as a consultant hired by the Open University officials. I’ve been called upon to analyze the problem of students’ churn based on the demographics, performance and behavioral data. This R Markdown document is created for OU Analytics team’s review only. If you would like to fork the code, make sure you’ve installed all the required packages in the library section. Source of the data can be found here: https://analyse.kmi.open.ac.uk/open_dataset.
#libraries
library(plotly)
library(data.table)
library(dplyr)
library(e1071)
library(ggplot2)
library(caret)
library(ggthemes)
library(rpart)
library(randomForest)
#setting path
path <- '~/Documents/Masters/Data_science/adobe_challenge/'
source(paste(path, 'src/load.src.r', sep=''))
load.src(paste(path, 'src/', sep = ''))
#Setting seed for session for reproducible results
set.seed(145)
This type of data contains the demographic-data and final_results of each student who has attended the courses. Please find the unique number of factors for each categorical variable.
Based on the information above, we are able to deduce information such as,
Expected Result: Reduction in failure and withdrawal rates across all modules!
#students information
student_info <- fread(paste(path,'data/',"studentInfo.csv",sep = ''),
stringsAsFactors = TRUE) %>%
dplyr::mutate(id_student = as.factor(id_student),
imd_band = as.factor(imd_band),
num_of_prev_attempts = as.integer(as.character(num_of_prev_attempts)),
studied_credits = as.integer(as.character(studied_credits)),
final_result = as.factor(final_result))
nrow(student_info)
## [1] 32593
apply(student_info, 2, function(x) length(unique(x)))
## code_module code_presentation id_student
## 7 4 28785
## gender region highest_education
## 2 13 5
## imd_band age_band num_of_prev_attempts
## 11 3 7
## studied_credits disability final_result
## 61 2 4
Quick-Fact: 3808 students have retaken the courses!
In order find a viable problem statement, we need to dig the data! Question1: What is our problem now ??
Answer: For this, I’ve grouped the final_results of each student with the course’s total strength in following code,
#group course strength to define our problem!
result_info <- student_info %>%
group_by(final_result) %>%
dplyr::summarise(module_strength = n()) %>%
dplyr::mutate(module_percentage = paste(round((module_strength/sum(module_strength))*100,digits = 2), "%"))
#plotly
plot_ly(result_info,
x = ~final_result, y = ~module_strength, type = 'bar', hoverinfo = 'text',
color = ~final_result,
text = ~paste(result_info$module_percentage, 'students', final_result)
) %>%
layout(title = "Overall Result distribution", showlegend = TRUE,
font = list(family = "sans-serif", size = 12,color = "black"),
xaxis = list(title = "Final Results"),
yaxis = list(title = "Number of Students"))
The student’s performance has been recorded for each assessment in terms of scores. Important features to be considered important for our analysis,
The student’s learning behaviour on each type of material available. This data gives the learning pattern of a particular student and our major features are,
Question2: Which dimensions have the highest failure rate or withdrawal rate?
For this question, we’ll start exploring student_info dataframe as it has the answer!
#failed students
#group course strength to define our problem!
churn_rate_per_module <- student_info %>%
group_by(code_module, final_result) %>%
dplyr::summarise(module_strength = n()) %>%
dplyr::mutate(module_percentage = paste(round((module_strength/sum(module_strength))*100,digits = 2), "%"))
#plotly
plot_ly(churn_rate_per_module,
x = ~code_module, y = ~module_strength, type = 'bar', hoverinfo = 'text',
color = ~final_result,
text = ~paste(churn_rate_per_module$module_percentage, 'students', final_result)
) %>%
layout(title = "Overall Result distribution across modules", showlegend = TRUE,
font = list(family = "sans-serif", size = 12,color = "black"),
xaxis = list(title = "Final Results"), barmode = "stack",
yaxis = list(title = "Number of Students"))
| Dimension | highest withdrawal rate | highest failure rate |
|---|---|---|
| course | CCC(45%) | GGG(29%) |
| course_presentation | 2014J(34%) | 2013B(27%) |
| gender | M(32%) | M(22%) |
| highest_education | No Formal Education(43%) | No Formal Education(28%) |
| imd_band | 0-10%(37%) | 0-10%(28%) |
| age_band | 0-35(32%) | 0-35(23%) |
| disability | Y(40%) | Y(23%) |
Answer2: Based on the above results, we’ve the answer, ‘Courses’ dimension houses the highest rate of withdrawal or failure rate!!
Now, we’ll look at assessments data. Below calculations show number of assessments in each module based on the 3 types of assessments.
#courses
courses <- fread(paste(path, 'data/',"courses.csv",sep = ''),
stringsAsFactors = TRUE) %>%
mutate(module_presentation_length = as.integer(as.character(module_presentation_length)))
#assessments
assessments <- fread(paste(path, 'data/',"assessments.csv",sep = ''),
stringsAsFactors = TRUE, na.strings = c('','?')) %>%
dplyr::mutate(id_assessment = as.factor(id_assessment),
weight = as.integer(as.character(weight)),
submission_date = as.integer(as.character(date))) %>%
select(-c(date))
#join courses + assessments
course_assess <- left_join(courses, assessments) %>%
group_by(code_module, assessment_type) %>%
dplyr::summarise(assessment_count = length(id_assessment))
## Joining, by = c("code_module", "code_presentation")
#visualizing!
plot_ly(course_assess,
x = ~code_module, y = ~assessment_count, type = 'bar', hoverinfo = 'text',
color = ~assessment_type,
text = ~paste(course_assess$assessment_count, '', course_assess$assessment_type)) %>%
layout(title = "Number of Assessments per Module",
font = list(family = "sans-serif", size = 12,color = "black"),
yaxis = list(title = ""), barmode = 'stack')
From the assessments data, we’ve an important performance metric, score
#studentAssessment
student_assess <- fread(paste(path, 'data/',"studentAssessment.csv", sep = ''),
stringsAsFactors = TRUE, na.strings = c('','?')) %>%
mutate(id_assessment = as.factor(id_assessment),
id_student = as.factor(id_student),
is_banked = as.factor(is_banked),
date_submitted = as.integer(as.character(date_submitted)),
score = as.integer(as.character(score)))
#loading again for new analysis
course_assess <- left_join(courses, assessments)
## Joining, by = c("code_module", "code_presentation")
#joining course_assess + student assess
course_ave_assess <- merge(course_assess, student_assess) %>%
group_by(code_module, code_presentation, id_assessment) %>%
dplyr::summarise(ave_score = mean(score, na.rm = T))
#viz
plot_ly(course_ave_assess,
x = ~code_module, y = ~ave_score, type = 'bar', hoverinfo = 'text',
color = ~code_presentation,
text = ~paste(course_ave_assess$ave_score, '', course_ave_assess$code_presentation)) %>%
layout(title = "Average scores across all the modules and presentations",
font = list(family = "sans-serif", size = 12,color = "black"),
yaxis = list(title = ""), barmode = 'group')
From the above plot, the students’ performance is similar throughout all the courses and presentations. I’ve discussed more on digging deep, in Feature engineering part.
Next, we’ll see learning behaviour data from each module and presentation perspective. Note: student_vle, a manstrous file has 10 million transactional data on ‘6268’ learning materials for ‘26074’ students!
#studentVle
student_vle <- fread(paste(path, 'data/', "studentVle.csv", sep = ''),showProgress = F,
stringsAsFactors = TRUE, na.strings = c('','?')) %>%
mutate(id_student = as.factor(id_student),
id_site = as.factor(id_site),
sum_click = as.integer(as.character(sum_click)),
interaction_date = as.integer(as.character(date))) %>%
select(-c(date))
#calculating number of sites per module and assessment
stud_learn <- student_vle %>%
group_by(code_module, code_presentation) %>%
dplyr::summarise(num_sites = length(unique(id_site)))
#viz
plot_ly(stud_learn,
x = ~code_module, y = ~num_sites, type = 'bar', hoverinfo = 'text',
color = ~code_presentation,
text = ~paste(stud_learn$num_sites, '', stud_learn$code_presentation)) %>%
layout(title = "Amount of leanring materials per module and assessment",
font = list(family = "sans-serif", size = 12,color = "black"),
yaxis = list(title = ""), barmode = 'stack')
Exploratory data analysis comes to an end. Moving further, I would discuss core part of this anaylsis. So far, we’ve seen how demographics, performance and behaviour data and it’s basic statistics provided by the OU team.
I’m updating as we learn insights from the data! So,
Expected Result: Reduction in failure and withdrawal rates across all modules by analysing demographic, performance and behaviour patterns.
This step is a crucial step as we approach th expected result, I’m taking few assumptions based on the feature-set available. This part consists of two parts,
1. Subset creation
Question3: Why do we create subsets?
Answer3: Occam’s Principle(simple, small and sample)
In our business case, we’ve exploded comprehensive data of 28785 students’ with corresponding 43 features. We need to join every table to extract useful information for feature modeling. This could be achieved only by using multiple joins which would explode number of rows beyond one machine’s capability. Even if a beasty machine permits, for testing purposes, it is not advised to use complete dataset and build models as we might miss granular information in the process. So, start simple, small and sample!
I’ve created the subset based on our initial analysis of finding the right dimension which contributes the most withdrawal rates across all modules.
code_module: CCC code_presentation: 2014J
Now, we’ve one course from one period. We can concentrate more on the core analysis than hassling with joins across multiple modules!
2. Feature Extraction
From the initial analysis, I’m splitting the problem statement in two ways,
Expected Result(Fail) and Expected Result(Withdrawal). This is because we need to analyze students in two-bucket, either students withdrew or not, or they failed or not
We’re proceeding with Withdrawal rate model to build a target feature as,
1 - Withdrew student
0 - Failed, Passed, Passed with distinction student
Question: Why this feature?
Answer: This feature would help us in classifying right candidates for the analysis and so on.
I’m using assessment and vle data initially to extract few metrics and use demographic data later in the process.
Loading all required datasets for the subset: courses, assessment, student_assessment, student_reg, vle, student_vle
#courses(No use as of now)
courses <- fread(paste(path, 'data/',"courses.csv",sep = ''),
stringsAsFactors = TRUE) %>%
mutate(module_presentation_length = as.integer(as.character(module_presentation_length))) %>%
filter(code_module == "CCC" & code_presentation == "2014B") %>%
droplevels()
#assessments
assessments <- fread(paste(path, 'data/',"assessments.csv",sep = ''),
stringsAsFactors = TRUE, na.strings = c('','?')) %>%
dplyr::mutate(id_assessment = as.factor(id_assessment),
weight = as.integer(as.character(weight)),
submission_date = as.integer(as.character(date))) %>%
select(-c(date)) %>%
filter(code_module == "CCC" & code_presentation == "2014B") %>%
droplevels()
#join courses + assessments - Join1
course_assess <- left_join(courses, assessments)
#students information
student_info <- fread(paste(path,'data/',"studentInfo.csv",sep = ''),
stringsAsFactors = TRUE) %>%
mutate(id_student = as.factor(id_student),
imd_band = as.factor(imd_band),
num_of_prev_attempts = as.integer(as.character(num_of_prev_attempts)),
studied_credits = as.integer(as.character(studied_credits)),
final_result = factor(ifelse(final_result == "Withdrawn","1","0"))) %>%
select(c(code_module, code_presentation, id_student, studied_credits, final_result)) %>%
filter(code_module == "CCC" & code_presentation == "2014B") %>%
droplevels()
table(student_info$final_result)
##
## 0 1
## 1038 898
Quick Stats: Total #students: 1936; #withdrew: 898; #completed: 1036; churn_rate = 46.38%
#studentAssessment
student_assess <- fread(paste(path, 'data/',"studentAssessment.csv", sep = ''),
stringsAsFactors = TRUE, na.strings = c('','?')) %>%
mutate(id_assessment = as.factor(id_assessment),
id_student = as.factor(id_student),
is_banked = as.factor(is_banked),
date_submitted = as.integer(as.character(date_submitted)),
score = as.integer(as.character(score)))
#test join -> cour_ass + student_info(success)
course_assess_info <- left_join(course_assess, student_info)
#join student_ass now
stud_course_assess_info <- left_join(course_assess_info, student_assess) %>%
mutate(id_student = factor(id_student),
id_assessment = factor(id_assessment))
#joining student_reg as I need to know when most of the poeple left and mark a vline???
student_reg <- fread(paste(path,'data/',"studentRegistration.csv",sep = ''),
stringsAsFactors = TRUE, na.strings = c('','?')) %>%
mutate(id_student = as.factor(id_student),
date_registration = as.integer(as.character(date_registration)),
date_unregistration = as.integer(as.character(date_unregistration))) %>%
select(-c(date_registration))%>%
filter(code_module == "CCC" & code_presentation == "2014B") %>%
droplevels()
#join stu_cours_ass_info with student_reg
stud_course_assess_info <- left_join(stud_course_assess_info, student_reg)
#student-stats
student_perform <- stud_course_assess_info %>%
group_by(id_student, id_assessment, final_result) %>%
dplyr::summarise(sc = sum(score, na.rm = T),
date_unregistration = date_unregistration,
submission_date = ifelse(is.na(submission_date),max(courses$module_presentation_length),submission_date)) %>%
filter(!(id_assessment == '40087')) #removing NA assessment
The final dataframe, ‘student_perform’ has, scores, submission_date for each assessment, the student has taken. We use this data for visualizing student performance over time!!
Question: Why scores and submission_date as metrics?
Answer: scores gives student’s performance. For second dimension to project scores, I chose submission date becasue date_submitted from the ‘student_assess’ table has lot of missing values and I insist data quality.
We chose 4 random students to analyse the performance,
Completed = 1038161, 105523
Withdrew = 1951080, 627711
stud_perf_viz <- subset(student_perform, id_student == '1951080' |
id_student == '627711' |
id_student == '1038161'|
id_student == '105523') %>%
arrange(submission_date)
#beautiful visuals!! major for dealing with performance model
ggplot(data = stud_perf_viz, aes(y=sc, x=submission_date,
color = id_student, group = id_student)) +
geom_line() +
stat_smooth(geom="line", color="green") +
facet_wrap(~id_student, scales = "free") +
theme_bw()
Last plots are good visual interpretation of student’s performace over time. This is my first analysis to capture the rate of change(ROC) over time as a metric for performance. But, rules for ROC is very tough to decode for ‘1936’ students in the given time.
Challenges in using ROC as a predictor:
Not giving up!! I’ve decided to dig further to include learning behaviour information too.
Introducting new timeline, ‘weeks’ instead of day. Why? Answer: Interaction days are scattered based on individual’s usage behaviour. To capture this, I came up with the novel idea of binning time through weeks
#student_vle
student_vle <- fread(paste(path, 'data/', "studentVle.csv", sep = ''),showProgress = F,
stringsAsFactors = TRUE, na.strings = c('','?')) %>%
mutate(id_student = as.factor(id_student),
id_site = as.factor(id_site),
sum_click = as.integer(as.character(sum_click)),
interaction_date = as.integer(as.character(date))) %>%
select(-c(date)) %>%
filter(code_module == "CCC" & code_presentation == "2014B") %>%
dplyr::mutate(weeks = factor(round(interaction_date/7))) %>% #weeks conversion
droplevels()
#vle
vle <- fread(paste(path, 'data/',"vle.csv",sep = ''),
stringsAsFactors = TRUE, na.strings = c('','?')) %>%
mutate(id_site = as.factor(id_site)) %>%
select(-c(week_from, week_to)) %>% #removing week_from/week_to, too much NA
filter(code_module == "CCC" & code_presentation == "2014B")
#joining vle + student_vle
stud_vle <- left_join(vle, student_vle) %>%
select(-c(interaction_date))
#VLEs + Performance
final <- left_join(stud_course_assess_info, stud_vle) %>%
dplyr::mutate(date_unregistration = factor(round(date_unregistration/7))) %>%
select(-c(code_module, code_presentation, module_presentation_length))
### find #assessments, vle_clicks/assessment, ave/sum scores
#summarized weekly metric
stud_vle_with <- final %>%
dplyr::group_by(id_student, weeks, final_result) %>%
dplyr::summarise(clicks = sum(sum_click, na.rm = T),
sc = mean(score, na.rm = T)) %>%
dplyr::mutate(sc = ifelse(is.na(sc),0,sc))
#plot random 4 stduents
stud_behaviour <- subset(stud_vle_with, id_student == '1951080' |
id_student == '627711' |
id_student == '1038161'|
id_student == '105523') %>%
arrange(weeks)
#beautiful visuals!! major for dealing with performance model
ggplot(data = stud_behaviour, aes(y=clicks, x=weeks,
color = id_student, group = id_student)) +
geom_line() +
stat_smooth(geom="line", color="green") +
# geom_vline(xintercept = (stud_1038161$date_unregistration[1]),
# linetype="dashed", color = "red", size=1) +
facet_wrap(~id_student, scales = "free") +
theme_bw()
The plots above clearly shows learning behaviour of 2 withdrew students and 2 completed students. Relative change in the learning behaviour is very tough to deduce because,
Challenges:
Total number of clicks/visits is not directly proportional to the performance of the student. Example: One student’s clicking/visting behaviour might convey ambiguous pattern because he might use it to understand it clearly. Conversely, one student might use the material one or two times, yet manage to pass the course.
A student may/may not access the material, because he might’ve known the content from any sources.
For these reasons, I’ve came up with churn predictor metric as follows. Combining all the information, I’m building the critical features for a predictive modeling,
Question: Why do you need new features?
Answer: We don’t have joint variables giving us straight information to deduce learning pattern and performance of a student, who would withdrew or fail!
Metrics:
ave_score_per_student = This feature averages all the scores across every assessment. This translates transfers performance of any student.
ave_clicks_per_week = This feature gives the weekly interating aka learning pattern of any student
num_weeks_interacting = This feature gives the frequency of a student using/visiting the learning material per week. This feature has the best predictive power than any other variable!
#final_week_metric dataframe
final_week_metric_data <- stud_vle_with %>%
dplyr::group_by(id_student, final_result) %>%
dplyr::summarise(ave_clicks_week = mean(clicks, na.rm = T),
num_weeks = length(weeks),
ave_sc = mean(sc))
Question: How did you validate?
Answer: Based on our new feature’s nature, it would be perfect to use ANOVA sample test
#check distributions, balance, anova, model
#model
a <- aov(ave_clicks_week ~ final_result, data = final_week_metric_data) #not correlated
b <- aov(num_weeks ~ final_result, data = final_week_metric_data)
c <- aov(ave_sc ~ final_result, data = final_week_metric_data)
summary(a)
## Df Sum Sq Mean Sq F value Pr(>F)
## final_result 1 29860243 29860243 187.5 <2e-16 ***
## Residuals 1934 308020279 159266
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(b)
## Df Sum Sq Mean Sq F value Pr(>F)
## final_result 1 155332 155332 1727 <2e-16 ***
## Residuals 1934 173917 90
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(c)
## Df Sum Sq Mean Sq F value Pr(>F)
## final_result 1 819706 819706 1048 <2e-16 ***
## Residuals 1934 1512035 782
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Performing F-test to check the variance of the metrics
var.test(final_week_metric_data$ave_clicks_week, final_week_metric_data$ave_sc)
##
## F test to compare two variances
##
## data: final_week_metric_data$ave_clicks_week and final_week_metric_data$ave_sc
## F = 144.9, num df = 1935, denom df = 1935, p-value < 2.2e-16
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 132.5471 158.4146
## sample estimates:
## ratio of variances
## 144.9048
Performing Kolmogorov And Smirnov Test to check the metric distributions
ks.test(final_week_metric_data$ave_clicks_week, final_week_metric_data$ave_sc)
##
## Two-sample Kolmogorov-Smirnov test
##
## data: final_week_metric_data$ave_clicks_week and final_week_metric_data$ave_sc
## D = 0.79959, p-value < 2.2e-16
## alternative hypothesis: two-sided
Correlation check
cor(final_week_metric_data$ave_sc, final_week_metric_data$num_weeks)
## [1] 0.8078528
All 3 potential metrics seems to be good from the proven statistical tests. We’ll see how it performs from modeling perspective.
Finally, I’m combining demographics to built a model with complete set of features. So,
#students information
student_info <- fread(paste(path,'data/',"studentInfo.csv",sep = ''),
stringsAsFactors = TRUE) %>%
mutate(id_student = as.factor(id_student),
imd_band = as.factor(imd_band),
num_of_prev_attempts = as.integer(as.character(num_of_prev_attempts)),
studied_credits = as.integer(as.character(studied_credits)),
final_result = factor(ifelse(final_result == "Withdrawn","1","0")))%>%
filter(code_module == "CCC" & code_presentation == "2014B") %>%
droplevels()
#Joining student_info
final_info <- left_join(final_week_metric_data, student_info, by=c("id_student", "final_result"))
names(final_info)
## [1] "id_student" "final_result" "ave_clicks_week"
## [4] "num_weeks" "ave_sc" "code_module"
## [7] "code_presentation" "gender" "region"
## [10] "highest_education" "imd_band" "age_band"
## [13] "num_of_prev_attempts" "studied_credits" "disability"
#removing unwanted features to make it more general!
final_df <- subset(final_info, select = -c(id_student, code_module, code_presentation)) #main dataset
Checking out feature importance for selecting final model with most predictive power and better business decisions!
rf <- randomForest(final_result ~., data = final_df)
varImp(rf)
#Must change this based on the importance of the rf$Importance results
final_df <- subset(final_df, select = -c(gender, age_band, disability, num_of_prev_attempts))
data.table(final_df)
Dealing with BIAS:
Class Balance: One advantage in our case, the classess are properly balanced! We’re escaped from MODEL BIAS:)
Normalization/Scaling: We have to scale numerical features before feeding into the model. Fortunately, I’ve captured right metric which doesn’t need any scaling thus eliminating model performance BIAS too.
Selection Bias: This comes from the root of the problem, of sample population hypothesis in using only one course and presentation. The learning behaviour and performance from the available dataset is similar across all the modules. So, I did generalize the selected data for each student which does not consider any presentation time or demogaphics related information. This would eliminate null hypothesis of having a bias in the results. Check results for more information.
Validation Bias: To handle this I’m splitting final_df into train and test samples to validate our classification models on unseen data.
#Divide into training and text data
train <- sample(1:nrow(final_df),.80*nrow(final_df))
final_train <- final_df[train,]
final_test <- final_df[-train,]
As part of the modeling, I’ve used logistic regression, KNN, random forest
1. Logistic Regression(LR):
Why?
Answer: Logistic regression is the one of best classification model used for binary cases due to it’s usage simplicity. It calculates the class probability which makes it easier to glean insights. Also, it has it’s own assumption that target must be purely binary and it satisfies in our case.
#Apply LR model
glm.fit <- glm(final_result~., data = final_train, family = "binomial")
summary(glm.fit)
##
## Call:
## glm(formula = final_result ~ ., family = "binomial", data = final_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4302 -0.4742 -0.2118 0.5911 2.5963
##
## Coefficients:
## Estimate Std. Error z value
## (Intercept) 1.6797719 0.4126741 4.070
## ave_clicks_week 0.0009760 0.0002148 4.543
## num_weeks -0.1533522 0.0106955 -14.338
## ave_sc -0.0099283 0.0033479 -2.966
## regionEast Midlands Region 0.7645534 0.3754659 2.036
## regionIreland 0.6102717 0.5100401 1.197
## regionLondon Region 0.4115512 0.3360025 1.225
## regionNorth Region 0.6173141 0.4240993 1.456
## regionNorth Western Region 0.5021486 0.3468796 1.448
## regionScotland -0.0034197 0.3062319 -0.011
## regionSouth East Region 0.2413689 0.3769149 0.640
## regionSouth Region 0.0677154 0.3382934 0.200
## regionSouth West Region 0.2731798 0.3827200 0.714
## regionWales -0.3544761 0.3894992 -0.910
## regionWest Midlands Region 0.2648782 0.3482371 0.761
## regionYorkshire Region 0.2641684 0.3872122 0.682
## highest_educationHE Qualification 0.0362989 0.1981586 0.183
## highest_educationLower Than A Level -0.0874126 0.1724694 -0.507
## highest_educationNo Formal quals 0.9620605 0.9129837 1.054
## highest_educationPost Graduate Qualification 0.7656949 0.5348219 1.432
## imd_band10-20 -0.3291759 0.3439622 -0.957
## imd_band20-30% -0.0387535 0.3273133 -0.118
## imd_band30-40% -0.1141795 0.3408490 -0.335
## imd_band40-50% -0.1991014 0.3490872 -0.570
## imd_band50-60% -0.0196513 0.3386028 -0.058
## imd_band60-70% 0.3319216 0.3631441 0.914
## imd_band70-80% 0.3545068 0.3505858 1.011
## imd_band80-90% 0.1103693 0.3452969 0.320
## imd_band90-100% -0.0983758 0.3656498 -0.269
## imd_band? -0.9100508 0.4924819 -1.848
## studied_credits 0.0040371 0.0017847 2.262
## Pr(>|z|)
## (Intercept) 4.69e-05 ***
## ave_clicks_week 5.53e-06 ***
## num_weeks < 2e-16 ***
## ave_sc 0.00302 **
## regionEast Midlands Region 0.04172 *
## regionIreland 0.23149
## regionLondon Region 0.22063
## regionNorth Region 0.14551
## regionNorth Western Region 0.14772
## regionScotland 0.99109
## regionSouth East Region 0.52193
## regionSouth Region 0.84135
## regionSouth West Region 0.47536
## regionWales 0.36278
## regionWest Midlands Region 0.44688
## regionYorkshire Region 0.49509
## highest_educationHE Qualification 0.85466
## highest_educationLower Than A Level 0.61227
## highest_educationNo Formal quals 0.29200
## highest_educationPost Graduate Qualification 0.15223
## imd_band10-20 0.33856
## imd_band20-30% 0.90575
## imd_band30-40% 0.73764
## imd_band40-50% 0.56844
## imd_band50-60% 0.95372
## imd_band60-70% 0.36071
## imd_band70-80% 0.31193
## imd_band80-90% 0.74924
## imd_band90-100% 0.78790
## imd_band? 0.06462 .
## studied_credits 0.02370 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2136.7 on 1547 degrees of freedom
## Residual deviance: 1229.8 on 1517 degrees of freedom
## AIC: 1291.8
##
## Number of Fisher Scoring iterations: 5
From the above summary output, we can see the significant variables of the featured metrics, region, highest education and lot of in-significant demographic factors. Now, we’re classifying on unseen test data, which would validate the above logistic fit.
#predicting test data
glm.pred.test <- predict(glm.fit, newdata = final_test,type="response")
glm.pred.result <- round(glm.pred.test)
caret::confusionMatrix(glm.pred.result, final_test$final_result, mode="everything")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 167 28
## 1 37 156
##
## Accuracy : 0.8325
## 95% CI : (0.7915, 0.8683)
## No Information Rate : 0.5258
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.6649
## Mcnemar's Test P-Value : 0.3211
##
## Sensitivity : 0.8186
## Specificity : 0.8478
## Pos Pred Value : 0.8564
## Neg Pred Value : 0.8083
## Precision : 0.8564
## Recall : 0.8186
## F1 : 0.8371
## Prevalence : 0.5258
## Detection Rate : 0.4304
## Detection Prevalence : 0.5026
## Balanced Accuracy : 0.8332
##
## 'Positive' Class : 0
##
Based on the confusion matrix, the logistic regression giving balanced accruacy of ~80%. This is great accuracy than expected. If we remove insignificant features, we can expect higher accruacy but may lead to overfit(danger for any model) resulting is high model bias! Let’s see how other model performs.
2. Naive Bayes(NB):
Naive bayes model is based on the conditional probability calculation. Main advantage of using this model is to analyze the priori and posterior probabilities of the features used in the model. Computationally super-fast and easy to implement. The assumption of independence between the predictors are verified from the statistical tests above.
nb.fit <- naiveBayes(final_result~., data = final_train)
nb.pred <- predict(nb.fit, newdata = final_test, type="class")
caret::confusionMatrix(nb.pred, final_test$final_result, mode="everything")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 168 40
## 1 36 144
##
## Accuracy : 0.8041
## 95% CI : (0.7611, 0.8425)
## No Information Rate : 0.5258
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.6068
## Mcnemar's Test P-Value : 0.7308
##
## Sensitivity : 0.8235
## Specificity : 0.7826
## Pos Pred Value : 0.8077
## Neg Pred Value : 0.8000
## Precision : 0.8077
## Recall : 0.8235
## F1 : 0.8155
## Prevalence : 0.5258
## Detection Rate : 0.4330
## Detection Prevalence : 0.5361
## Balanced Accuracy : 0.8031
##
## 'Positive' Class : 0
##
Naivebayes performs well from the calculated conditional and priori to predict posterior probabilities giving balanced accruacy of ~80%.
3. KNearest Neighbor(KNN):
K-nearest neighbors is classification model, which classifies the candidate based on the ‘k’ nearest neighbors. I chose this model because the model could correctly classify right candidates who have similar or closest pattern to the withdrew candidates. Also, this model is really easy to deduce the logic behind the black box.
knn.fit <- knn3(final_result~., data = final_train, k=7)
knn.pred.result <- predict(knn.fit, newdata = final_test, type="class")
caret::confusionMatrix(knn.pred.result, final_test$final_result, mode="everything")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 178 55
## 1 26 129
##
## Accuracy : 0.7912
## 95% CI : (0.7474, 0.8306)
## No Information Rate : 0.5258
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5781
## Mcnemar's Test P-Value : 0.001864
##
## Sensitivity : 0.8725
## Specificity : 0.7011
## Pos Pred Value : 0.7639
## Neg Pred Value : 0.8323
## Precision : 0.7639
## Recall : 0.8725
## F1 : 0.8146
## Prevalence : 0.5258
## Detection Rate : 0.4588
## Detection Prevalence : 0.6005
## Balanced Accuracy : 0.7868
##
## 'Positive' Class : 0
##
From the confusion matrix, we observe the balanced ccuracy is ~76%. This model has it’s own cons as finding the right value of ‘k’ has great deal which affects prediction accuracy. Also, the distance between neighbor points and computation time makes it really tough model to chose.
4. Random Forest(RF):
My favorite algorithm I’d use for any classification algorithm. Random forests are powerful since they would handle correlation when multiple factors involved, missing values, and reduced variance. Also, this model is a superman as we can use this to see variable importance directly extending it’s functionality other than classifier. Let’s see how random forest performs,
rf.fit <- randomForest::randomForest(final_result~., data = final_train, importance=TRUE, ntree=500)
rf.fit$importance
## 0 1 MeanDecreaseAccuracy
## ave_clicks_week 0.0403976617 0.019895148 0.0309080767
## num_weeks 0.1901859251 0.142756941 0.1681062322
## ave_sc 0.1077516327 0.058170842 0.0847789507
## region 0.0052364619 0.002343396 0.0038634183
## highest_education 0.0001193035 -0.001794374 -0.0007979793
## imd_band 0.0048412070 -0.002805559 0.0013068990
## studied_credits 0.0008148732 0.003404666 0.0020002345
## MeanDecreaseGini
## ave_clicks_week 114.87352
## num_weeks 253.21043
## ave_sc 169.29258
## region 79.58817
## highest_education 24.90103
## imd_band 71.44092
## studied_credits 33.17256
#variable importance
From the variable importance table, we can see demogrpahic factors like gender, age_band, and disability usage leads to decrease in the accruacy! We can remove while re-validating the models. Lets build prediction,
rf.pred <- predict(rf.fit, newdata = final_test)
caret::confusionMatrix(rf.pred, final_test$final_result, mode="everything") #84% specificity
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 164 19
## 1 40 165
##
## Accuracy : 0.8479
## 95% CI : (0.8083, 0.8822)
## No Information Rate : 0.5258
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.6968
## Mcnemar's Test P-Value : 0.00922
##
## Sensitivity : 0.8039
## Specificity : 0.8967
## Pos Pred Value : 0.8962
## Neg Pred Value : 0.8049
## Precision : 0.8962
## Recall : 0.8039
## F1 : 0.8475
## Prevalence : 0.5258
## Detection Rate : 0.4227
## Detection Prevalence : 0.4716
## Balanced Accuracy : 0.8503
##
## 'Positive' Class : 0
##
Confusion matrix clearly gives a balanced accruacy of ~82% similar to naiveBayes.
5. Support vector Machine (SVM):
This is a sophisticated logistic regression capable of handling non-linear boundary over decision variables. This alogorithm works well for high-dimensional data but takes good amount of computation time to arrive at the final boundary based on the ‘kernel’ usage.
svm.fit <- svm(final_result~., data = final_train)
summary(svm.fit)
##
## Call:
## svm(formula = final_result ~ ., data = final_train)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 1
## gamma: 0.03225806
##
## Number of Support Vectors: 688
##
## ( 337 351 )
##
##
## Number of Classes: 2
##
## Levels:
## 0 1
svm.pred <- predict(svm.fit, newdata = final_test)
caret::confusionMatrix(svm.pred, final_test$final_result, mode="everything")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 165 25
## 1 39 159
##
## Accuracy : 0.8351
## 95% CI : (0.7943, 0.8706)
## No Information Rate : 0.5258
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.6705
## Mcnemar's Test P-Value : 0.1042
##
## Sensitivity : 0.8088
## Specificity : 0.8641
## Pos Pred Value : 0.8684
## Neg Pred Value : 0.8030
## Precision : 0.8684
## Recall : 0.8088
## F1 : 0.8376
## Prevalence : 0.5258
## Detection Rate : 0.4253
## Detection Prevalence : 0.4897
## Balanced Accuracy : 0.8365
##
## 'Positive' Class : 0
##
It has the balanced accuracy ~81%. This model uses Model selection:
Validating metric: minimize Specificity or False positives
False Positives: The candidates who have been wrongly classified as they would quit/withdrew
True Negatives: The candidates who have been correctly classified as they would quit/withdrew
Specificity: This metric is computed as, (TN)/(TN + FP), would give the balanced prportion of the candidates who would correctly quit/withdrew. This metric increases when number of False positives(FP) decrease. As per below table,
| Model | #False positives | Specificity |
|---|---|---|
| Logistic Regression | 30 | ~84% |
| Naive Bayes | 31 | ~84% |
| KNearest Neighbors | 47 | ~75% |
| Random Forests | 13 | ~93% |
| Support Vector Machines | 20 | 89% |
I choose the super-hero, Random Forest as it performs exceedingly weel on test data yielding just 13 False positives and 93% specificity!
Subset validation: This validation is to justify my hypothesis due to the machine constraints.
Null-hypothesis: The subset used for the analysis is biased and different from other subsets and whole dataset Alternate-hypothesis: The subset used for the analysis is a candidate of other subsets and whole dataset
subset: course - “DDD”, “2014J”, #students - 1803, #with - 647, completed - 1156
I re-ran whole script for the above subset and yielded the following result.
| Model | #False positives | Specificity |
|---|---|---|
| Logistic Regression | 24 | ~80% |
| Naive Bayes | 28 | ~77% |
| KNearest Neighbors | 38 | ~68% |
| Random Forests | 18 | ~85% |
| Support Vector Machines | 27 | 78% |
Finally my assumptions, featured metrics, models, and statistics considered were promising with random forest model yielding the utmost specificity of 85%! Thus, rejecting null-hypothesis based on my intuition and validation results!
susbet: course = “BBB”, presentation = “2014B”
#courses(No use as of now)
courses <- fread(paste(path, 'data/',"courses.csv",sep = ''),
stringsAsFactors = TRUE) %>%
mutate(module_presentation_length = as.integer(as.character(module_presentation_length))) %>%
filter(code_module == "BBB" & code_presentation == "2014B") %>%
droplevels()
#assessments
assessments <- fread(paste(path, 'data/',"assessments.csv",sep = ''),
stringsAsFactors = TRUE, na.strings = c('','?')) %>%
dplyr::mutate(id_assessment = as.factor(id_assessment),
weight = as.integer(as.character(weight)),
submission_date = as.integer(as.character(date))) %>%
select(-c(date)) %>%
filter(code_module == "BBB" & code_presentation == "2014B") %>%
droplevels()
#join courses + assessments - Join1
course_assess <- left_join(courses, assessments)
## Joining, by = c("code_module", "code_presentation")
#students information
student_info <- fread(paste(path,'data/',"studentInfo.csv",sep = ''),
stringsAsFactors = TRUE) %>%
mutate(id_student = as.factor(id_student),
imd_band = as.factor(imd_band),
num_of_prev_attempts = as.integer(as.character(num_of_prev_attempts)),
studied_credits = as.integer(as.character(studied_credits))) %>%
select(c(code_module, code_presentation, id_student, studied_credits, final_result)) %>%
filter(!(final_result == "Withdrawn")) %>%
dplyr::mutate(final_result = factor(ifelse(final_result == "Fail","1","0"))) %>%
filter(code_module == "BBB" & code_presentation == "2014B") %>%
droplevels()
table(student_info$final_result)
##
## 0 1
## 727 396
Total #students : 1123 #Failed : 396 #Completed : 727 #Churn rate : 35%
#studentAssessment
student_assess <- fread(paste(path, 'data/',"studentAssessment.csv", sep = ''), showProgress = F,
stringsAsFactors = TRUE, na.strings = c('','?')) %>%
mutate(id_assessment = as.factor(id_assessment),
id_student = as.factor(id_student),
is_banked = as.factor(is_banked),
date_submitted = as.integer(as.character(date_submitted)),
score = as.integer(as.character(score)))
#test join -> cour_ass + student_info(success)
course_assess_info <- left_join(course_assess, student_info)
## Joining, by = c("code_module", "code_presentation")
#join student_ass now
stud_course_assess_info <- left_join(course_assess_info, student_assess) %>%
mutate(id_student = factor(id_student),
id_assessment = factor(id_assessment)) %>%
dplyr::mutate(score = ifelse(is.na(score),0,score))
## Joining, by = c("id_assessment", "id_student")
## Warning: Column `id_assessment` joining factors with different levels,
## coercing to character vector
## Warning: Column `id_student` joining factors with different levels,
## coercing to character vector
#joining student_reg as I need to know when most of the poeple left and mark a vline???
student_reg <- fread(paste(path,'data/',"studentRegistration.csv",sep = ''),
stringsAsFactors = TRUE, na.strings = c('','?')) %>%
mutate(id_student = as.factor(id_student),
date_registration = as.integer(as.character(date_registration)),
date_unregistration = as.integer(as.character(date_unregistration))) %>%
select(-c(date_registration))%>%
filter(code_module == "BBB" & code_presentation == "2014B") %>%
droplevels()
#join stu_cours_ass_info with student_reg
stud_course_assess_info <- left_join(stud_course_assess_info, student_reg)
## Joining, by = c("code_module", "code_presentation", "id_student")
## Warning: Column `id_student` joining factors with different levels,
## coercing to character vector
#student_vle
student_vle <- fread(paste(path, 'data/', "studentVle.csv", sep = ''),
stringsAsFactors = TRUE, na.strings = c('','?')) %>%
mutate(id_student = as.factor(id_student),
id_site = as.factor(id_site),
sum_click = as.integer(as.character(sum_click)),
interaction_date = as.integer(as.character(date))) %>%
select(-c(date)) %>%
filter(code_module == "BBB" & code_presentation == "2014B") %>%
dplyr::mutate(weeks = factor(round(interaction_date/7))) %>% #weeks conversion
droplevels()
##
Read 53.3% of 10655280 rows
Read 10655280 rows and 6 (of 6) columns from 0.423 GB file in 00:00:03
#vle
vle <- fread(paste(path, 'data/',"vle.csv",sep = ''),
stringsAsFactors = TRUE, na.strings = c('','?')) %>%
mutate(id_site = as.factor(id_site)) %>%
select(-c(week_from, week_to)) %>% #removing week_from/week_to, too much NA
filter(code_module == "BBB" & code_presentation == "2014B")
#joining vle + student_vle
stud_vle <- left_join(vle, student_vle) %>%
select(-c(interaction_date))
## Joining, by = c("id_site", "code_module", "code_presentation")
## Warning: Column `id_site` joining factors with different levels, coercing
## to character vector
## Warning: Column `code_module` joining factors with different levels,
## coercing to character vector
## Warning: Column `code_presentation` joining factors with different levels,
## coercing to character vector
#VLEs + Performance
final <- left_join(stud_course_assess_info, stud_vle) %>%
dplyr::mutate(date_unregistration = factor(round(date_unregistration/7))) %>%
select(-c(code_module, code_presentation, module_presentation_length))
## Joining, by = c("code_module", "code_presentation", "id_student")
## Warning: Column `code_module` joining factor and character vector, coercing
## into character vector
## Warning: Column `code_presentation` joining factor and character vector,
## coercing into character vector
## Warning: Column `id_student` joining character vector and factor, coercing
## into character vector
Expertise from building churn rate prediction model, I tried using the same metrics for predicting failure by removing people who withdrew, in order to minimize the effect unwanted features. I’ve extracted two features for the failure model,
Average score: This feature is critical and straightforward feature hosting the predictive power of all student’s performance. The intuition behind this concept is taken from my high school days. My course co-ordinator used calculate average after every assessment to see how students performed so far and ‘guess’ if a student might fail or pass. I’ve used the same concept and models accepted the same.
Total Vle clicks: This feature calculates the total sum of the clicks per student. Capturing behaviour of the student as a predictor gives good accruacy of the model!
### find total vle_clicks, average_assessment scores
#summarized weekly metric
final_week_metric_data <- final %>%
dplyr::group_by(id_student, final_result) %>%
dplyr::summarise(vle_clicks = sum(sum_click, na.rm = T),
ave_sc = mean(score, na.rm = T)) %>%
dplyr::mutate(ave_sc = ifelse(is.na(ave_sc),0,ave_sc)) %>%
na.omit()
a <- aov(vle_clicks ~ final_result, data = final_week_metric_data) #not correlated
b <- aov(ave_sc ~ final_result, data = final_week_metric_data)
summary(a)
## Df Sum Sq Mean Sq F value Pr(>F)
## final_result 1 1.730e+10 1.730e+10 111.5 <2e-16 ***
## Residuals 1121 1.739e+11 1.552e+08
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(b)
## Df Sum Sq Mean Sq F value Pr(>F)
## final_result 1 438560 438560 1720 <2e-16 ***
## Residuals 1121 285867 255
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Performing F-test to check the variance of the metrics
var.test(final_week_metric_data$ave_sc, final_week_metric_data$vle_clicks)
##
## F test to compare two variances
##
## data: final_week_metric_data$ave_sc and final_week_metric_data$vle_clicks
## F = 3.7881e-06, num df = 1122, denom df = 1122, p-value < 2.2e-16
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 3.369522e-06 4.258604e-06
## sample estimates:
## ratio of variances
## 3.788068e-06
Performing Kolmogorov And Smirnov Test to check the metric distributions
ks.test(final_week_metric_data$ave_sc, final_week_metric_data$vle_clicks)
##
## Two-sample Kolmogorov-Smirnov test
##
## data: final_week_metric_data$ave_sc and final_week_metric_data$vle_clicks
## D = 0.94835, p-value < 2.2e-16
## alternative hypothesis: two-sided
Correlation check
cor(final_week_metric_data$ave_sc, final_week_metric_data$vle_clicks)
## [1] 0.3690335
#students information
student_info <- fread(paste(path,'data/',"studentInfo.csv",sep = ''),
stringsAsFactors = TRUE) %>%
mutate(id_student = as.factor(id_student),
imd_band = as.factor(imd_band),
num_of_prev_attempts = as.integer(as.character(num_of_prev_attempts)),
studied_credits = as.integer(as.character(studied_credits))) %>%
filter(!(final_result == "Withdrawn")) %>%
dplyr::mutate(final_result = factor(ifelse(final_result == "Fail","1","0"))) %>%
filter(code_module == "BBB" & code_presentation == "2014B") %>%
droplevels()
#Joining student_info
final_info <- left_join(final_week_metric_data, student_info, by=c("id_student", "final_result"))
## Warning: Column `id_student` joining character vector and factor, coercing
## into character vector
names(final_info)
## [1] "id_student" "final_result" "vle_clicks"
## [4] "ave_sc" "code_module" "code_presentation"
## [7] "gender" "region" "highest_education"
## [10] "imd_band" "age_band" "num_of_prev_attempts"
## [13] "studied_credits" "disability"
#removing unwanted features to make it more general!
final_df <- subset(final_info, select = -c(id_student, code_module, code_presentation)) #main dataset
names(final_df)
## [1] "final_result" "vle_clicks" "ave_sc"
## [4] "gender" "region" "highest_education"
## [7] "imd_band" "age_band" "num_of_prev_attempts"
## [10] "studied_credits" "disability"
#checking variable importance
rf <- randomForest(final_result ~., data = final_df)
rf$importance
## MeanDecreaseGini
## vle_clicks 116.215588
## ave_sc 274.690066
## gender 3.350762
## region 36.407346
## highest_education 10.978431
## imd_band 37.756609
## age_band 5.027555
## num_of_prev_attempts 6.941802
## studied_credits 10.584830
## disability 3.748279
final_df <- subset(final_df, select = -c(gender, age_band, disability))
data.table(final_df)
#Divide into training and text data handling bias
train <- sample(1:nrow(final_df),.80*nrow(final_df))
final_train <- final_df[train,]
final_test <- final_df[-train,]
#Apply LR model
glm.fit <- glm(final_result~., data = final_train, family = "binomial")
summary(glm.fit)
##
## Call:
## glm(formula = final_result ~ ., family = "binomial", data = final_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.8586 -0.4157 -0.1892 0.0903 3.3596
##
## Coefficients:
## Estimate Std. Error z value
## (Intercept) 7.254e+00 8.341e-01 8.698
## vle_clicks -2.833e-05 1.925e-05 -1.472
## ave_sc -1.264e-01 1.031e-02 -12.258
## regionEast Midlands Region 2.407e-01 5.540e-01 0.435
## regionIreland 2.300e-01 9.089e-01 0.253
## regionLondon Region -4.165e-01 5.860e-01 -0.711
## regionNorth Region 1.461e+00 9.057e-01 1.613
## regionNorth Western Region 5.164e-01 6.057e-01 0.853
## regionScotland 8.615e-01 5.577e-01 1.545
## regionSouth East Region 1.586e-01 6.309e-01 0.251
## regionSouth Region 1.606e-01 5.674e-01 0.283
## regionSouth West Region 2.587e-01 6.407e-01 0.404
## regionWales -2.094e-01 5.325e-01 -0.393
## regionWest Midlands Region 1.053e-01 5.552e-01 0.190
## regionYorkshire Region -6.305e-03 6.384e-01 -0.010
## highest_educationHE Qualification -3.962e-01 4.551e-01 -0.870
## highest_educationLower Than A Level -7.397e-02 2.636e-01 -0.281
## highest_educationNo Formal quals -8.737e-01 9.169e-01 -0.953
## highest_educationPost Graduate Qualification -1.178e+01 1.566e+03 -0.008
## imd_band10-20 -2.544e-01 4.228e-01 -0.602
## imd_band20-30% -4.233e-01 4.332e-01 -0.977
## imd_band30-40% -7.405e-01 4.827e-01 -1.534
## imd_band40-50% -9.380e-01 5.236e-01 -1.791
## imd_band50-60% -2.050e+00 5.759e-01 -3.559
## imd_band60-70% -6.914e-01 5.812e-01 -1.190
## imd_band70-80% -4.899e-01 5.583e-01 -0.877
## imd_band80-90% -1.928e+00 6.591e-01 -2.925
## imd_band90-100% -1.575e+00 6.595e-01 -2.388
## imd_band? -1.533e+01 8.697e+02 -0.018
## num_of_prev_attempts 2.673e-01 2.250e-01 1.188
## studied_credits -8.998e-04 3.367e-03 -0.267
## Pr(>|z|)
## (Intercept) < 2e-16 ***
## vle_clicks 0.141079
## ave_sc < 2e-16 ***
## regionEast Midlands Region 0.663915
## regionIreland 0.800222
## regionLondon Region 0.477276
## regionNorth Region 0.106802
## regionNorth Western Region 0.393934
## regionScotland 0.122365
## regionSouth East Region 0.801578
## regionSouth Region 0.777158
## regionSouth West Region 0.686396
## regionWales 0.694142
## regionWest Midlands Region 0.849551
## regionYorkshire Region 0.992120
## highest_educationHE Qualification 0.384073
## highest_educationLower Than A Level 0.779055
## highest_educationNo Formal quals 0.340639
## highest_educationPost Graduate Qualification 0.993995
## imd_band10-20 0.547290
## imd_band20-30% 0.328521
## imd_band30-40% 0.125023
## imd_band40-50% 0.073216 .
## imd_band50-60% 0.000372 ***
## imd_band60-70% 0.234155
## imd_band70-80% 0.380238
## imd_band80-90% 0.003447 **
## imd_band90-100% 0.016932 *
## imd_band? 0.985936
## num_of_prev_attempts 0.234832
## studied_credits 0.789287
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1156.1 on 897 degrees of freedom
## Residual deviance: 462.0 on 867 degrees of freedom
## AIC: 524
##
## Number of Fisher Scoring iterations: 15
From the summary of logistic regession in predicting failure model, we can see ave_clicks_per_week is less significant than other two metrics
glm.pred.test <- predict(glm.fit, newdata = final_test,type="response")
glm.pred.result <- round(glm.pred.test)
caret::confusionMatrix(glm.pred.result, final_test$final_result, mode="everything")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 134 16
## 1 4 71
##
## Accuracy : 0.9111
## 95% CI : (0.8661, 0.9449)
## No Information Rate : 0.6133
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.8077
## Mcnemar's Test P-Value : 0.01391
##
## Sensitivity : 0.9710
## Specificity : 0.8161
## Pos Pred Value : 0.8933
## Neg Pred Value : 0.9467
## Precision : 0.8933
## Recall : 0.9710
## F1 : 0.9306
## Prevalence : 0.6133
## Detection Rate : 0.5956
## Detection Prevalence : 0.6667
## Balanced Accuracy : 0.8936
##
## 'Positive' Class : 0
##
nb.fit <- naiveBayes(final_result~., data = final_train)
nb.pred <- predict(nb.fit, newdata = final_test, type="class")
caret::confusionMatrix(nb.pred, final_test$final_result, mode="everything")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 125 13
## 1 13 74
##
## Accuracy : 0.8844
## 95% CI : (0.8353, 0.9231)
## No Information Rate : 0.6133
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.7564
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9058
## Specificity : 0.8506
## Pos Pred Value : 0.9058
## Neg Pred Value : 0.8506
## Precision : 0.9058
## Recall : 0.9058
## F1 : 0.9058
## Prevalence : 0.6133
## Detection Rate : 0.5556
## Detection Prevalence : 0.6133
## Balanced Accuracy : 0.8782
##
## 'Positive' Class : 0
##
knn.fit <- knn3(final_result~., data = final_train, k=7)
knn.pred.result <- predict(knn.fit, newdata = final_test, type="class")
caret::confusionMatrix(knn.pred.result, final_test$final_result, mode="everything")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 134 31
## 1 4 56
##
## Accuracy : 0.8444
## 95% CI : (0.7904, 0.8892)
## No Information Rate : 0.6133
## P-Value [Acc > NIR] : 2.924e-14
##
## Kappa : 0.6521
## Mcnemar's Test P-Value : 1.109e-05
##
## Sensitivity : 0.9710
## Specificity : 0.6437
## Pos Pred Value : 0.8121
## Neg Pred Value : 0.9333
## Precision : 0.8121
## Recall : 0.9710
## F1 : 0.8845
## Prevalence : 0.6133
## Detection Rate : 0.5956
## Detection Prevalence : 0.7333
## Balanced Accuracy : 0.8073
##
## 'Positive' Class : 0
##
rf.fit <- randomForest::randomForest(final_result~., data = final_train, importance=TRUE, ntree=500)
rf.fit$importance
## 0 1 MeanDecreaseAccuracy
## vle_clicks 0.0595395152 0.0236058631 4.703533e-02
## ave_sc 0.1847777082 0.2584070182 2.095743e-01
## region 0.0043112792 -0.0021063395 2.062941e-03
## highest_education -0.0006926918 -0.0022519847 -1.225296e-03
## imd_band 0.0042843401 0.0097772020 6.145308e-03
## num_of_prev_attempts 0.0040721285 0.0017996238 3.296530e-03
## studied_credits 0.0005242562 -0.0008927435 4.427896e-05
## MeanDecreaseGini
## vle_clicks 96.194694
## ave_sc 203.988276
## region 32.982259
## highest_education 9.792089
## imd_band 33.759248
## num_of_prev_attempts 9.074957
## studied_credits 9.513417
rf.pred <- predict(rf.fit, newdata = final_test)
caret::confusionMatrix(rf.pred, final_test$final_result, mode="everything") #84% specificity
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 135 13
## 1 3 74
##
## Accuracy : 0.9289
## 95% CI : (0.8871, 0.9588)
## No Information Rate : 0.6133
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.8468
## Mcnemar's Test P-Value : 0.02445
##
## Sensitivity : 0.9783
## Specificity : 0.8506
## Pos Pred Value : 0.9122
## Neg Pred Value : 0.9610
## Precision : 0.9122
## Recall : 0.9783
## F1 : 0.9441
## Prevalence : 0.6133
## Detection Rate : 0.6000
## Detection Prevalence : 0.6578
## Balanced Accuracy : 0.9144
##
## 'Positive' Class : 0
##
svm.fit <- svm(final_result~., data = final_train)
summary(svm.fit)
##
## Call:
## svm(formula = final_result ~ ., data = final_train)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 1
## gamma: 0.03225806
##
## Number of Support Vectors: 307
##
## ( 156 151 )
##
##
## Number of Classes: 2
##
## Levels:
## 0 1
svm.pred <- predict(svm.fit, newdata = final_test)
caret::confusionMatrix(svm.pred, final_test$final_result, mode="everything")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 136 18
## 1 2 69
##
## Accuracy : 0.9111
## 95% CI : (0.8661, 0.9449)
## No Information Rate : 0.6133
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.806
## Mcnemar's Test P-Value : 0.0007962
##
## Sensitivity : 0.9855
## Specificity : 0.7931
## Pos Pred Value : 0.8831
## Neg Pred Value : 0.9718
## Precision : 0.8831
## Recall : 0.9855
## F1 : 0.9315
## Prevalence : 0.6133
## Detection Rate : 0.6044
## Detection Prevalence : 0.6844
## Balanced Accuracy : 0.8893
##
## 'Positive' Class : 0
##
Validation for failure rate prediction model:
| Model | #False positives | Specificity |
|---|---|---|
| Logistic Regression | 15 | ~81% |
| Naive Bayes | 9 | ~89% |
| KNearest Neighbors | 33 | ~60% |
| Random Forests | 11 | ~86% |
| Support Vector Machines | 19 | 77% |
Based on the overall results, I chose NaiveBayes classifier as it performs well with only 9 False positives(Type 1 error) and ~89% accuracy!
Churn Prediction: Based on my comprehensive analysis, I’m splitting my recommnedations into two parts,
Business:
Break the timeline in weeks which would give a good variance for capturing student’s chruning behaviour Predicting churn rate must use all the features from demographics, performance and behavioural pattern of the students.
The student wouldn’t share why he is quitting! So, perform A/B tests via web, email, ads, webminars from professionals to see if the performance/behavioral change impacting decrease in chrun rate.
Create course with higher engaging content. Get feedbacks from the perons who’s showing poor performances, skipping assessments, very low page visits/logins. This way we could learn what the customer wants!
Appoint special teams to take care of students who is having higher probability of failing to ask feedbacks via phone, e-mail, or create anonymous chathelp desks.
Improve content quality by providing challenging students, team-wise competitions, internal chat/messages between students. This would eliminate the gap in learning online platform compared to a classroom study.
Compliment students by giving virtual trophies, sharing student profiles across various courses based on their performances and continuous engagement(everyday login). This is called Gamification where we induce customers by incentives as games give points for everyday login!!
Technical:
Student’s Churn prediction model must include student’s performance, learning behaviour and demographics data to calculate compound metrics for effective prediction on chrun and fail rates.
Break the timeline in weeks which would give a good variance for capturing student’s chruning behaviour.
Build alarming systems to monitor student’s performance over time based on the sophisticated models.
Extenisve work on feature engineering for a real-time prediction model which would signal student’s at risk! This can be done by analyzing each and every students pattern like ROC, momentum, differences, relative change, rolling performance/behaviour metrics, etc.,
From the existing model, we can build more non-linear multiplicative models than additive methods to attain greater model performance.
We could create a compound metric which would capture logic for classifying multiple targets - pass, fail or withdraw in a single model. This would save lot of time and great adavntage while scaling into production systems too.
When the feature set increases, we can use more sophisticated models like Ensemble methods, Bagging, Boosting, Decision trees, etc.,
With enough data, we could build an intelligent system to regress the chrun event, i.e., when the student porbably would leave!
Thanks for reading my data story. Share your thoughts!